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Abstract 

Non-convex sparsity-inducing penalties have 
recently received considerable attentions in 
sparse learning. Recent theoretical inves- 
tigations have demonstrated their superior- 
ity over the convex counterparts in several 
sparse learning settings. However, solving 
the non-convex optimization problems asso- 
ciated with non-convex penalties remains a 
big challenge. A commonly used approach is 
the Multi-Stage (MS) convex relaxation (or 
DC programming), which relaxes the original 
non-convex problem to a sequence of convex 
problems. This approach is usually not very 
practical for large-scale problems because its 
computational cost is a multiple of solving 
a single convex problem. In this paper, we 
propose a General Iterative Shrinkage and 
Thresholding (GIST) algorithm to solve the 
nonconvex optimization problem for a large 
class of non-convex penalties. The GIST al- 
gorithm iteratively solves a proximal opera- 
tor problem, which in turn has a closed-form 
solution for many commonly used penalties. 
At each outer iteration of the algorithm, we 
use a line search initialized by the Barzilai- 
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Borwein (BB) rule that allows finding an ap- 
propriate step size quickly. The paper also 
presents a detailed convergence analysis of 
the GIST algorithm. The efhciency of the 
proposed algorithm is demonstrated by ex- 
tensive experiments on large-scale data sets. 



1. Introduction 

Learning sparse representations has important appli- 
cations in many areas of science and engineering. 
The use of an €o-norm regularizer leads to a sparse 
solution, however the ^o-norm regularized optimiza- 
tion problem is challenging to solve, due to the dis- 
continuity and non-convexity of the £o-norm regu- 
larizer. The £i-norm regularizer, a continuous and 
convex surrogate, has been studied extensively in 
the literature (Tibshirani, 1996; Efron et al., 2004) 
and has been applied successfully to many applica- 
tions including signal/image processing, biomedical in- 
formatics and computer vision (Shevade & Keerthi, 
2003; Wright et al., 2008; Beck & Teboulle, 2009; 
Wright et al., 2009; Ye & Liu, 2012). Although the h- 
norm based sparse learning formulations have achieved 
great success, they have been shown to be subopti- 
mal in many cases (Candes et al., 2008; Zhang, 2010b; 
2012), since the £i-norm is a loose approximation 
of the ^o-norm and often leads to an over-penalized 
problem. To address this issue, many non-convex 
regularizers, interpolated between the ^g-norm and 
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the ^i-norm, have been proposed to better approxi- 
mate the £o-iiorm. They include £q-norm {0 < q < 
1) (Foucart & Lai, 2009), Smoothly Clipped Abso- 
lute Deviation (SCAD) (Fan & Li, 2001), Log-Sum 
Penalty (LSP) (Candes et al., 2008), Minimax Con- 
cave Penalty (MCP) (Zhang, 2010a), Geman Penalty 
(CP) (Geman & Yang, 1995; Trzasko & Manduca, 
2009) and Capped-^i penalty (Zhang, 2010b; 2012; 
Gong et al., 2012a). 

Although the non-convex regularizers (penalties) are 
appealing in sparse learning, it is challenging to solve 
the corresponding non-convex optimization problems. 
In this paper, we propose a General Iterative Shrinkage 
and Thresholding (GIST) algorithm for a large class 
of non-convex penalties. The key step of the proposed 
algorithm is to compute a proximal operator, which 
has a closed-form solution for many commonly used 
non-convex penalties. In our algorithm, we adopt the 
Barzilai-Borwein (BB) rule (Barzilai & Borwein, 1988) 
to initialize the line search step size at each iteration, 
which greatly accelerates the convergence speed. We 
also use a non-monotone line search criterion to further 
speed up the convergence of the algorithm. In addi- 
tion, we present a detailed convergence analysis for the 
proposed algorithm. Extensive experiments on large- 
scale real-world data sets demonstrate the efficiency of 
the proposed algorithm. 

2. The Proposed Algorithm: GIST 
2.1. General Problems 

We consider solving the following general problem: 

min {f{w) ^l{w) + r{w)} . (1) 

We make the following assumptions on the above for- 
mulation throughout the paper: 

Al Z(w) is continuously differentiable with Lipschitz 
continuous gradient, that is, there exists a positive 
constant /3{l) such that 

||V;(w) - V/(u)|l < /3(0||w - u||, Vw,u e K"*. 

A2 r(w) is a continuous function which is possibly 
non-smooth and non- convex, and can be rewritten 
as the difference of two convex functions, that is, 

r(w) = ri(w) - r2(w), 

where ri(w) and r2(w) arc convex functions. 

A3 /(w) is bounded from below. 



Remark 1 We say that w* is a critical point of 
problem (1), if the following holds (Toland, 1979; 
Wright et al., 2009): 

e V/(w*) + 9ri(w*) - ar2(w*), 

where 9ri(w*) is the sub- differential of the function 
ri(w) at w = w* , that is, 

dri{w*) = {s : ri{w) > ri{w*) + {s,w - w*),yw £ R^} . 

We should mention that the sub- differential is non- 
empty on any convex function; this is why we make 
the assumption that r(w) can be rewritten as the dif- 
ference of two convex functions. 

2.2. Some Examples 

Many formulations in machine learning satisfy the as- 
sumptions above. The following least square and logis- 
tic loss functions are two commonly used ones which 
satisfy assumption Al: 

1 1 " 

^(w) = ll^w - y f or - ^ log (l 4- exp(-y,xf w)) , 

where X = [x^; • • • ; x^] £ K"^*^ is a data matrix and 
y = ■ • • J Vn]'^ g E" is a target vector. The regular- 
izers (penalties) which satisfy the assumption A2 are 
presented in Table 1. They are non-convex (except the 
£i-norm) and extensively used in sparse learning. The 
functions ^w) and r(w) mentioned above are nonneg- 
ative. Hence, / is bounded from below and satisfies 
assumption A3. 

2.3. Algorithm 

Our proposed General Iterative Shrinkage and Thresh- 
olding (GIST) algorithm solves problem (1) by gener- 
ating a sequence {w^'^^} via: 

wC'^+i) =argmin l{^^^'^''^) + (VZ(w('=)), w - w^^') 

W 

+ — l|w-wWf + r(w), (2) 

In fact, problem (2) is equivalent to the following prox- 
imal operator problem: 

^{k+i) =argmin -\\w - u'^''^]'^ + -^r{w), 
w 2 t^ 

where u^'^) = w^^' - Vl{w^''^)/t'^''\ Thus, in GIST 
we first perform a gradient descent along the direction 
— VZ(w('^)) with step size l/t^^^^ and then solve a prox- 
imal operator problem. For all the regularizers listed 
in Table 1, problem (2) has a closed-form solution (de- 
tails are provided in the Appendix), although it may 
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Table 1. Examples of regularizers (penalties) r(w) satisfying the assumption A2 and the corresponding convex func- 
tions ri(w) and r2(w). A > is the regularization parameter; r-(w) = '^iri{wi),ri{w) = '^iri^i{wi),r2{w) = 
Y^i Ti.iiwi), [x]+ = max(0, x). 



Name 


ri{wi) 


ri.i{wi) 


r2,i{wi) 


^i-norm 


A|u 






X\wi\ 





LSP 


Alog(l + lwi|/6l) (e>0) 


\~i 1 

X\wi\ 


A(|m,| -log(l + |™,|/e)) 


SCAD 


-\ 


Wi\ ■ [SA-a:], ^ 


da; (61 > 2) 
if < A, 

if A < \w.,\ < 6X, 

if i^ii > ex. 


X\wi\ 


H 


uj^ [miii((? A,a:) — A] _|_ , 


f A|w,|, 


(fl-l)A "'^ 

' 0, if \w^\ < X, 

'"•l^'r;]^'', ifA<K|<eA, 


2(9-1) 

I (e + l)AV2, 


x\w,\ '^"+1^'-', ii\w^\>ex. 


MCP 


H 


[i~A] + dx {e>o) 

' A|u),| -w;2/(26'), if < ^A, 
6'AV2, if \w- \ > eA. 


X\wi\ 


H 


min(l, a;/(6'A))dx 

' W'i/(2e), if < 6IA, 

A|wi| - eX^/2, if u>i > fA. 


Capped £i 


Amin(|m,|,6») (6» > 0) 


Aw, 


Al|«;,| + 



Algorithm 1 GIST: General Iterative Shrinkage and 
Thresholding Algorithm 
1: Choose parameters r] > 1 and fmin,imax with < 

^ min ^ ^ max i 

2: Initialize iteration counter fc -f- and a bounded start- 



(0). 



9 
10 



ing point w 
repeat 

repeat 



1 tr] 



w 



(fc+l) 



argmin^ [{w^"') + (V/(w('=)),w 



w('=)> + i^||w-w(")f +r(w); 

until some line search criterion is satisfied 
fc <- fc + 1 

until some stopping criterion is satisfied 



be a non-convex problem. For example, for the £i and 
Capped £i regularizers, we have closed-form solutions 
as follows: 

h : u.f = sign(uf max (o, |uf''| - X/t^^^^ , 

Capped ^i:z.f+^' = ( ^[^^i^]) < 

' [ X2, otherwise, 

where xi = sign(u-'''') maxd?/-''^!, 0), X2 = 
sign(uf^)min(6', - A/t('=)] + ) and hi{x) ^ 

0.5{x - + X/t'^''^ mm{\x\,e). The detailed 

procedure of the GIST algorithm is presented in 
Algorithm 1. There are two issues that remain to be 
addressed: how to initialize t^'') (in Line 4) and how 
to select a line search criterion (in Line 8) at each 
outer iteration. 

2.3.1. The Step Size Initialization: l/t^'"' 

Intuitively, a good step size initialization strategy at 
each outer iteration can greatly reduce the line search 
cost (Lines 5-8) and hence is critical for the fast con- 



vergence of the algorithm. In this paper, we propose 
to initialize the step size by adopting the Barzilai- 
Borwein (BE) rule (Barzilai & Borwein, 1988), which 
uses a diagonal matrix i^'^^/ to approximate the Hes- 
sian matrix V^Z(w) at w = w'''"^ Denote 

= wW - wC^-i), = V;(w(^)) - VZ(w('=-i)). 
Then t^'^^ is initialized at the outer iteration k as 
=argmin||<xW -y(^)f = if^^'^) 



2.3.2. Line Search Criterion 

One natural and commonly used line search criterion 
is to require that the objective function value is mono- 
tonically decreasing. More specifically, we propose to 
accept the step size l/t^'^) at the outer iteration k if the 
following monotone line search criterion is satisfied: 

/(wC^+i)) < /(w(^)) - ^tW||w('=+i) - wWf , (3) 

where ct is a constant in the interval (0, 1). 

A variant of the monotone criterion in Eq. (3) is a non- 
monotone line search criterion (Grippo et al., 1986; 
Grippo & Sciandrone, 2002; Wright et al., 2009). It 
possibly accepts the step size l/i^*^^ even if w'^'^+^) 
yields a larger objective function value than w'*^-' . 
Specifically, we propose to accept the step size 
if v^r('^+^' makes the objective function value smaller 
than the maximum over previous m {m > 1) itera- 
tions, that is, 



/(w^'^+i)) < 



where a £ (0, 1). 



max 

-max(0,A: — m+l), 



/(wW) 



,(fe+i) 



(4) 
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2.3.3. Convergence Analysis 

Inspired by Wright et al (2009); Lu (2012a), we 
present detailed convergence analysis under both 
monotone and non-monotone line search criteria. We 
first present a lemma which guarantees that the mono- 
tone line search criterion in Eq. (3) is satisfied. This 
is a basic support for the convergence of Algorithm 1. 

Lemma 1 Let the assumptions A1-A3 hold and the 
constant a G (0, 1) he given. Then for any integer 
k > 0, the monotone line search criterion in Eq. (3) 
is satisfied whenever t^^"^ > /3(/)/(l — a). 

Proof Since w('^+^' is a minimizer of problem (2), we 
have 

tC'l 



(V/(wW),w('=+i) -wW) + ^|Iw('=+i) -wWf 



It follows from assumption Al that 

^wC^+i)) <Z(w('^-)) + (V?(w(^-)),w('=+i) -wW) 



(5) 



2 



(6) 



Combining Eq. (5) and Eq. (6), we have 

/(w('^+^)) + r(w('=+i)) < l{v^^''^) + r(w('=)) 
_!!!l_Mllw(fc+i)_wW||2 



It follows that 

/(w^'^+i)) < /(wW) - ^^!^_M||w(^+i) - f . 

Therefore, the line search criterion in Eq. (3) is sat- 
isfied whenever (i^*^) - /3(/))/2 > ct^C^VS, i.e., tC') > 
/?(/)/(! — cr). This completes the proof the lemma. □ 

Next, we summarize the boundedness of t^'^^ in the 
following lemma. 

Lemma 2 For any k > 0, t^^^ is bounded under the 
monotone line search criterion in Eq. (3). 

Proof It is trivial to show that i'-'^^ is bounded from 
below, since t'-'^' > fmin (imin is defined in Algo- 
rithm 1). Next wc prove that t'^^^ is bounded from 
above by contradiction. Assume that there exists 
a > 0, such that t^*^' is unbounded from above. 
Without loss of generality, we assume that i'*^' in- 
creases monotonically to -|-oo and t^'^^ > rjPil) / {1 — a). 
Thus, the value t = t^'^^r] > ,9(0/(1 - cr) must have 
been tried at iteration k and does not satisfy the line 



search criterion in Eq. (3). But Lemma 1 states that 
t = t^'^^/r] > /3{1)/{1 — a) is guaranteed to satisfy the 
line search criterion in Eq. (3). This leads to a contra- 
diction. Thus, t'^'^) is bounded from above. □ 

Remark 2 We note that if Eq. (3) holds, Eq. (4) is 
guaranteed to be satisfied. Thus, the same conclusions 
in Lemma 1 and Lemma 2 also hold under the the 
non-monotone line search criterion in Eq. (4). 

Based on Lemma 1 and Lemma 2, we present our con- 
vergence result in the following theorem. 

Theorem 1 Let the assumptions A1-A3 hold and the 
monotone line search criterion in Eq. (3) he satisfied. 
Then all limit points of the sequence {w^*^^} generated 
by Algorithm 1 are critical points of problem (1). 

Proof Based on Lemma 1, the monotone line search 
criterion in Eq. (3) is satisfied and hence 

/(w(^+i))</(wW),Vfc>0, 

which implies that the sequence {/(w^'^-')}^_p is 
monotonically decreasing. Let w* be a limit point 
of the sequence {w^'^^}, that is, there exists a subse- 
quence K, such that 

lim w^'^^ = w*. 

Since / is bounded from below, together with the 
fact that {/(w^*"'')} is monotonically decreasing, 
limfe^oo /(w*^*^^) exists. Observing that / is contin- 
uous, we have 

lim /(w^^)) = lim /(w^'^')) = /(w*). 

Taking limits on both sides of Eq. (3) with fc G /C, we 
have 



lim llwC^+i) - wC^^II = 0. 



(7) 



Considering that the minimizer w('''+^' is also a critical 
point of problem (2) and r(w) = ri(w) — 7'2(w), we 
have 

OeV/(wW) + i('=)(w(^+i) -wW) 
+ ari(w('^-+i))-ar2(w('=+i)). 

Taking limits on both sides of the above equation with 
fc G /C, by considering the semi-continuity oidri{-) and 
9r2(-), the boundedness of t^*^' (based on Lemma 2) 
and Eq. (7), we obtain 

G Vl{w*) + 9ri(w*) - (9r2(w*), 

Therefore, w* is a critical point of problem (1). This 
completes the proof of Theorem 1. □ 
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Based on Eq. (7), we know that liuikejc^oo ||w('^+-^) — 
^(fe)||2 = is a necessary optimahty condition of Al- 
gorithm 1. Thus, ||w'^'^+-'^' — w^'^'lp is a quantity to 
measure the convergence of the sequence {w'*^^} to 
a critical point. We present the convergence rate in 
terms of ||w'^''+-'^) — w^'^'^p in the following theorem. 

Theorem 2 Let {w^")} be the sequence generated by 
Algorithm 1 with the monotone line search criterion in 
Eq. (3) satisfied. Then for every n > 1, we have 

2(/(w(0))-/(w*)) 



min llw^'^+i) - wW||2 < 

0<fe<n 



ncrtr, 



where is a limit point of the sequence {w^'^^}. 
Proof Based on Eq. (3) with t'^'^^ > t,„in, we have 



at„ 



Summing the above inequality over k ~ 0, • • • , n, we 
obtain 



(Tt„ 



.^||w('^-+i)-wW|p</(w(0))-/(w 



fe=0 

which implies that 



mm w 

0<fe<ri 



2(/(w(0)) -/(w("+i))) 



natr 



< 



2(/(w(0))-/(w*)) 



This completes the proof of the theorem. 



□ 



Under the non-monotone line search criterion in 
Eq. (4), we have a similar convergence result in the 
following theorem (the proof uses an extension of ar- 
gument for Theorem 1 and is omitted). 

Theorem 3 Let the assumptions A1-A3 hold and the 
non-monotone line search criterion in Eq. (4-) be sat- 
isfied. Then all limit points of the sequence {w^'^^ } 
generated by Algorithm 1 are critical points of problem 

(!)■ 

Note that Theorem 1/Theorem 3 makes sense only if 
{w^'^^ I has limit points. By considering one more mild 
assumption: 

A4 /(w) — > +O0 when ||w|| — !> +c>o, 

we summarize the existence of limit points in the fol- 
lowing theorem (the proof is omitted): 

Theorem 4 Let the assumptions A1-A4 hold and 
the monotone/non-monotone line search criterion in 
Eq. (S)/Eq. (4) be satisfied. Then the sequence {w'-'^^} 
generated by Algorithm 1 has at least one limit point. 



2.3.4. Discussions 

Observe that Z (w^*^) ) ( VZ(w('=) ) , w - wC^) ) ^ || w - 
w'-'^' P can be viewed as an approximation of Z(w) at 
w = w'-'^-' . The GIST algorithm minimizes an approx- 
imate surrogate instead of the objective function in 
problem (1) at each outer iteration. We further ob- 
serve that if tC^) > /?(/)/(! - ct) > I3{1) [the sufficient 
condition of Eq. (3)], we obtain 

;(w) </(w('=)) + (VZ(w('^)),w-wW) 
+ ^||w-w(^-)|l2,VweM'^. 

It follows that 

/(w) = /(w) +r(w) < A/(v^f,w^''')),Vw G M'', 

where A/(vif, w^*"'') denotes the objective function of 
problem (2). We can easily show that 

/(wW) = A/(w('^-),w('^-)). 

Thus, the GIST algorithm is equivalent to solving a 
sequence of minimization problems: 



w 



(fe+i) _ 



argminM(w, w^''')), fc = 0, 1,2,--' 



and can be interpreted as the well-known Majorization 
and Minimization (MM) technique (Hunter & Lange, 
2000). 

Note that we focus on the vector case in this paper and 
the proposed GIST algorithm can be easily extended 
to the matrix case. 

3. Related Work 

In this section, we discuss some related algo- 
rithms. One commonly used approach to solve 
problem (1) is the Multi-Stage (MS) convex relax- 
ation (or CCCP, or DC programming) (Zhang, 2010b; 
Yuille & Rangarajan, 2003; Gasso et al., 2009). It 
equivalently rewrites problem (1) as 

min fi{^N) - /2(w), 

where /i(w) and /2(w) are both convex functions. 
The MS algorithm solves problem (1) by generating 
a sequence {w^'"')} as 



w^''+^^ =:argmin/i(w) - /2(w^'=0 

- (S2(WW),W-W('=)), 



(8) 



where S2(w^'^^) denotes a sub-gradient of /2(w) at 
w = w^*^) . Obviously, the objective function in prob- 
lem (8) is convex. The MS algorithm involves solving a 
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sequence of convex optimization problems as in prob- 
lem (8). In general, there is no closed- form solution 
to problem (8) and the computational cost of the MS 
algorithm is k times that of solving problem (8), where 
k is the number of outer iterations. This is computa- 
tionally expensive especially for large scale problems. 

A class of related algorithms called iterative shrink- 
age and thresholding (1ST), which are also known 
as different names such as fixed point iteration 
and forward-backward splitting (Daubechies et al., 
2004; Combettes & Wajs, 2005; Hale et al., 2007; 
Beck & Teboulle, 2009; Wright et al., 2009; Liu et al., 
2009), have been extensively applied to solve problem 
(1). The key step is by generating a sequence {w'^*^^} 
via solving problem (2). However, they require that 
the regularizer r(w) is convex and some of them even 
require that both Z(w) and r(w) are convex. Our pro- 
posed GIST algorithm is a more general framework, 
which can deal with a wider range of problems includ- 
ing both convex and non-convex cases. 

Another related algorithm called a Variant of Iterative 
Rcweighted La (VIRL) is recently proposed to solve 
the following optimization problem (Lu, 2012a): 

min |/(w) = Z(w) + AV(|u;,r + e,)«/4, 

where Q!>l,0<g<l,ei>0. VIRL solves the above 
problem by generating a sequence {w^'^-'} as 

wt^-'+i' = argmin/(w('=') + (V/(w('^')), w - w^^^) 

+ - wW f + ^ V(iu;,f r + eO'/"-'k.r . 

2 a ^ — ' 

i—l 

In VIRL, tC^"^) is chosen as the initialization of t'-'^'. 
The line search step in VIRL finds the smallest integer 
e with i^*^) = (77 > 1) such that 

/(wC^+i)) </(wW)-|||w(^-+i) -wW||2 (a>0). 

The most related algorithm to our propose GIST is 
the Sequential Convex Programming (SCP) proposed 
by Lu (2012b). SCP solves problem (1) by generating 
a sequence {w^''')} as 

w^'^+i) = argmin;(w('')) + (V;(w('')), v^r - w^^)) 

+ — l|w ~ wWf + ri(w) - r2(wW) - (s2, w - w'^)), 

where S2 is a sub-gradient of r2{^) at w = w^'^-'. Our 
algorithm differs from SCP in that the original regu- 
larizer r(w) = ri(w) — r2(w) is used in the proximal 



operator in problem (2), while ri(w) minus a locally 
linear approximation for r2(w) is adopted in SCP. We 
will show in the experiments that our proposed GIST 
algorithm is more efficient than SCP. 

4. Experiments 

4.1. Experimental Setup 

We evaluate our GIST algorithm by considering the 
Capped ti regularized logistic regression problem, that 
is Z(w) = i YJl^^ log (1 + exp{-y,xfw)) and r(w) = 

A X^iLi mindwil, We compare our GIST algorithm 
with the Multi-Stage (MS) algorithm and the SCP 
algorithm in different settings using twelve data sets 
summarized in Table 2. These data sets are high di- 
mensional and sparse. Two of them (news20, real- 
sim)^ have been prcprocessed as two-class data sets 
(Lin et al., 2008). The other ten^ are multi-class data 
sets. We transform the multi-class data sets into two- 
class by labeling the first half of all classes as positive 
class, and the remaining classes as the negative class. 

All algorithms are implemented in Matlab and exe- 
cuted on an Intel(R) Core(TM)2 Quad CPU (Q6600 
@2.4GIIz) with 8GB memory. We set a = 10"^, m = 
5,77 = 2, l/^min = tmax = 10"^° and choose the start- 
ing points w(°) of all algorithms as zero vectors. We 
terminate all algorithms if the relative change of the 
two consecutive objective function values is less than 
10^^ or the number of iterations exceeds 1000. The 
Matlab codes of the GIST algorithm are available on- 
line (Gong et al., 2013). 

4.2. Experimental Evaluation and Analysis 

We report the objective function value vs. CPU 
time plots with different parameter settings in Fig- 
ure 1. From these figures, we have the following ob- 
servations: (1) Both GISTbb-Monotone and GISTbb- 
Nonmonotone decrease the objective function value 
rapidly and they always have the fastest conver- 
gence speed, which shows that adopting the BB rule 
to initialize t^''^ indeed greatly accelerates the con- 
vergence speed. Moreover, both GISTbb-Monotone 
and GISTbb-Nonmonotone algorithms achieve the 
smallest objective function values. (2) GISTbb- 
Nonmonotone may give rise to an increasing objective 
function value but finally converges and has a faster 
overall convergence speed than GISTbb-Monotone in 
most cases, which indicates that the non-monotone 
line search criterion can further accelerate the con- 

""^ http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ 
^http://www.shi-zhong.com/software/docdata.zip 
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vergence speed. (3) SCPbb-Nonmonotone is compa- 
rable to GISTbb-Nonnioiiotone in several cases, how- 
ever, it converges much slower and achieves much 
larger objective function values than those of GISTbb- 
Nonmonotone in the remaining cases. This demon- 
strates the superiority of using the original regular- 
izer r(w) = ri(w) — r2(w) in the proximal operator 
in problem (2). (4) GIST-1 has a faster convergence 
speed than GIST-iC^-i) in most cases, which demon- 
strates that it is a bad strategy to use t^'^"^^ to initial- 
ize tC'). This is because {i^*^)} increases monotonically 
in this way, making the step size 1 /t^'^^ monotonically 
decreasing when the algorithm proceeds. 

5. Conclusions 

We propose an efficient iterative shrinkage and thresh- 
olding algorithm to solve a general class of non-convex 
optimization problems encountered in sparse learning. 
A critical step of the proposed algorithm is the com- 
putation of a proximal operator, which has a closed- 
form solution for many commonly used formulations. 
We propose to initialize the step size at each itera- 
tion using the BB rule and employ both monotone and 
non-monotone criteria as line search conditions, which 
greatly accelerate the convergence speed. Moreover, 
we provide a detailed convergence analysis of the pro- 
posed algorithm, showing that the algorithm converges 
under both monotone and non-monotone line search 
criteria. Experiments results on large-scale data sets 
demonstrate the fast convergence of the proposed al- 
gorithm. 

In our future work, we will focus on analyzing the the- 
oretical performance (e.g., prediction error bound, pa- 
rameter estimation error bound etc.) of the solution 
obtained by the GIST algorithm. In addition, we plan 
to apply the proposed algorithm to solve the multi- 
task feature learning problem (Gong et al., 2012a;b). 
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Appendix: Solutions to Problem (2) 

Observe that r(w) = X^^Li ''i(^i) problem (2) can 
be equivalently decomposed into d independent uni- 
variate optimization problems: 



(fe+i) . , 

wl = Big mm n 



M = I ( 



= - \ Wi-U 



1 



where i = ,d and uf^"^ is the i-th entry of 

u'^k) ^ ^(k) _ ^i(yf{k)yt{k) ^ To simplify the nota- 
tions, we unclutter the above equation by removing 
the subscripts and supscripts as follows: 

— argminft,i(w) = — {w — uf H — ri{w). (9) 

u; 2 i 



5;i-norm: w 



(fe+i) ^ 



sign(u) max (0, |m| — 



• LSP: We can obtain an optimal solution of prob- 
lem (9) via: ui'-'^^^^ = sign(M)2;, where x is an 
optimal solution of the following problem: 

1 2 A 

X = argmin —{w— \u\) H — log(l + w/6) 
u) 2 t 

s.t. w>0. 

Noting that the objective function above is differ- 
entiable in the interval [0, -f oo) and the minimum 
of the above problem is either a stationary point 
(the first derivative is zero) or an endpoint of the 
feasible region, we have 



X = argmin —{w — \u\f 
wee 2 



A 



log(l + u;/6i). 



where C is a set composed of 3 elements or 1 ele- 
ment. If e{\u\ - ef - 4i(A - t\u\e) > o, 

C = {0, 



t{\u\ -9)+ y/f2{\u\ - 6)^ - 4t{X - t\u\e) 



2t 



ti\u\ -6)- y/t'^{\u\-9f -U{\-t\u\B) 



2t 



Otherwise, C = {0}. 



SCAD: We can recast problem (9) into the fol- 
lowing three problems: 

1 , \2 A I I 11 

Xi ~ argmm — (w — u) H \w\ s.t. \w\ < A, 

10 2 t 

X2 — arg mm — [w — u) 
w 2 



-w'^ + 2e{\/t)\w\-{\/tf 
2{e-i) 



s.t. A < |w| < 9X, 



• 1 , x2 (6*+ 1)A2 , , 
X3 = argmm — (w — u) H ^ s.t.\w\>OX. 

nj 2 2i 

We can easily obtain that {x2 is obtained using the 
similar idea as LSP by considering that 9 > 2): 

x\ = sign(M) min(A, max(0, \u\ — \/t)), 

■ ( \ ■ (ax n t\u\{9-l)-9\ 
X2 = sign(M) mm^t/A, max(A, — — )), 



tie -2) 



X3 = sign(u) max(0A, |u|). 
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Table 2. Data sets statistics: n is the number of samples and d is the dimensionality of the data. 



No. 


1 
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4 
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7 
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10 


11 


12 


datascts 


classic 


hitcch 


klb 


lal2 


lal 


la2 


ncws20 


ng3sim 


ohscal 


rcal-sim 


reviews 


sports 


n 


7094 


2301 


2340 


2301 


3204 


3075 


19996 


2998 


11162 


72309 


4069 


8580 


d 


41681 


10080 


21839 


31472 


31472 


31472 


1355191 


15810 


11465 


20958 


18482 


14866 
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Figure 1. Objective function value vs. CPU time plots. MS-Nesterov/MS-SpaRSA: The Multi-Stage algorithm using the 
Nesterov/SpaRSA method to solve problem (8); GIST-l/GIST-t^'="^VGISTbb-Monotone/GISTbb-Nonmonotone: The 
GIST algorithm using l/f''="^VBB rule/BB rule to initiahze t^''' and Eq. (3)/Eq. (3)/Eq. (3)/Eq. (4) as the line search 
criterion; SCPbb-Nonmonotone: The SCP algorithm using the BB rule to initialize t'*"' and Eq. (4) as the line search 
criterion. Note that on data sets 'hitech' and 'real-sim', MS algorithms stop early (the SCP algorithm has similar behaviors 
on data sets 'hitech' and 'news20'), because they satisfy the termination condition that the relative change of the two 
consecutive objective function values is less than 10^^. However, their objective function values are much larger than 
those of GISTbb-Monotone and GISTbb-Nonmonotone. 



Thus, wc have 

^(fc+i) ^ argmin/ii(y) s.t. y G {xi, 0:2, 2:3}. 
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• MCP: Similar to SCAD, wc can recast problem 
(9) into the following two problems: 

■ 1 / \2 I I II 

xi = argmm- (w ~ u) H — \w\ — — s.t. \w\ < OX, 

A t 20 

. 1 , ,2 OiX/tf , , 

X2 = argmm — (w — u) H s.t. |w| > t^A. 

w 2 2 

We can easily obtain that 

Xi = sign(u)z, X2 = sign(u) max(6'A, 



and C = {0, OX} otherwise. Thus, we have 



xi, if hi{xi) < hi{x2) 
X2, otherwise. 



• Capped ^1: We can recast problem (9) into the 
following two problems: 

.1 2 A 

xi = argmin — (w — u) H s.t. \w\ > 0, 

w 2 t 

.1 2 A 

2;2 = argmin —(?« — u) H Iw] s.t.\w\<9. 

2 t 

We can easily obtain that 

Xi = sign(w) max(0, \u\), 

X2 = sign(u) min(6', max(0, |u| — X/t)). 

Thus, we have 



x2 , A 



t 261 ' ^ 



where z — argmin^ 2 ('^ ^ I" I 
{0, ^A, min (ex, max (o, ^|^^) ) } , if 6I- 1 7^ 0, 
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xi, if h^{xi) < h^{x2), 
X2 , otherwise. 
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